Web mining
Tools
Python
https://www.youtube.com/watch?v=ind-mugxMxk
- Beautiful soup (Python)
- Mechanize (Python)
- Twill (Python)
- http://github.com/petewarden/pyparallelcurl - A simple Python class for running multiple URL fetches in parallel
- pattern - “It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).”
- http://scrapy.org/
- lazynlp: Library to scrape and clean web pages to create massive datasets
R
Ruby
Web
Browser plugins
- imacros
- Selenium (software)
- https://chromewebstore.google.com/detail/easy-scraper-one-click-we/cljbfnedccphacfneigoegkiieckjndh?pli=1
Etc.
Tutorials
-
국회 사이트, 국회의원 목록 크롤링 - how to use the developer tool to figure out asynchronous requests and crawl a webpage without using tools like Selenium.
Articles
- The Perils of Web Crawling
- http://www.bytemining.com/2011/02/web-mining-pitfalls/
- http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data
- http://vancouverdata.blogspot.com/2011/02/how-to-web-scraping-xpath-html-google.html
- Web scraping 101 with python
- How to Crawl the Web Politely with Scrapy